Goto

Collaborating Authors

 non-adaptive neural network optimization method


Review for NeurIPS paper: Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods

Neural Information Processing Systems

Summary and Contributions: Post-rebuttal: Dear authors, thank you for your detailed response and offering to fix many points we raised. I would like to sum up my thoughts after having read the other reviews and your rebuttal: On a high level, the following aspects were most significant how I approached towards my final score: 1) The perspective is novel, and has interesting potential. Re 1: I think we all agree that this is a pro for the paper and should be considered its main strength. Re 2: Questioning the approximations is a valid point. However, as you argue, you provided sufficient empirical evidence for the mini-batch Gaussianity, and I think that Gaussianity is often assumed without further justification in other Bayesian inference applications as well, simply to keep the computations tractable. Even if the assumptions are not fully realistic, they seem to be "less concerning than those in past work" (rebuttal, line 19).


Review for NeurIPS paper: Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods

Neural Information Processing Systems

After a discussion with the reviewers, I converged towards recommending to accept this submission. The reviewers raised the following aspects: 1) The perspective is novel, and has interesting potential. Re 1: all reviewers agree that this is a pro for the paper and should be considered its main strength. The authors agree (rebuttal, lines 23-25). Re 2: R3 believes that questioning the approximations is a valid point. However, as the authors argue, they have provided sufficient empirical evidence for mini-batch Gaussianity in appendix B, and Gaussianity is sometimes assumed without further justification in other Bayesian inference applications as well, simply to keep the computations tractable.


Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods

Neural Information Processing Systems

We formulate the problem of neural network optimization as Bayesian filtering, where the observations are backpropagated gradients. While neural network optimization has previously been studied using natural gradient methods which are closely related to Bayesian inference, they were unable to recover standard optimizers such as Adam and RMSprop with a root-mean-square gradient normalizer, instead getting a mean-square normalizer. To recover the root-mean-square normalizer, we find it necessary to account for the temporal dynamics of all the other parameters as they are optimized. The resulting optimizer, AdaBayes, adaptively transitions between SGD-like and Adam-like behaviour, automatically recovers AdamW, a state of the art variant of Adam with decoupled weight decay, and has generalisation performance competitive with SGD.